
<!DOCTYPE html>

<html lang="zh">
  <head>
    <meta charset="utf-8" />
    <meta name="viewport" content="width=device-width, initial-scale=1.0" /><meta name="generator" content="Docutils 0.17.1: http://docutils.sourceforge.net/" />

    <title>2.3 并行计算简介 &#8212; 深入浅出PyTorch</title>
    
  <!-- Loaded before other Sphinx assets -->
  <link href="../_static/styles/theme.css?digest=1999514e3f237ded88cf" rel="stylesheet">
<link href="../_static/styles/pydata-sphinx-theme.css?digest=1999514e3f237ded88cf" rel="stylesheet">

    
  <link rel="stylesheet"
    href="../_static/vendor/fontawesome/5.13.0/css/all.min.css">
  <link rel="preload" as="font" type="font/woff2" crossorigin
    href="../_static/vendor/fontawesome/5.13.0/webfonts/fa-solid-900.woff2">
  <link rel="preload" as="font" type="font/woff2" crossorigin
    href="../_static/vendor/fontawesome/5.13.0/webfonts/fa-brands-400.woff2">

    <link rel="stylesheet" type="text/css" href="../_static/pygments.css" />
    <link rel="stylesheet" href="../_static/styles/sphinx-book-theme.css?digest=62ba249389abaaa9ffc34bf36a076bdc1d65ee18" type="text/css" />
    <link rel="stylesheet" type="text/css" href="../_static/togglebutton.css" />
    <link rel="stylesheet" type="text/css" href="../_static/mystnb.css" />
    <link rel="stylesheet" type="text/css" href="../_static/plot_directive.css" />
    
  <!-- Pre-loaded scripts that we'll load fully later -->
  <link rel="preload" as="script" href="../_static/scripts/pydata-sphinx-theme.js?digest=1999514e3f237ded88cf">

    <script data-url_root="../" id="documentation_options" src="../_static/documentation_options.js"></script>
    <script src="../_static/jquery.js"></script>
    <script src="../_static/underscore.js"></script>
    <script src="../_static/doctools.js"></script>
    <script>let toggleHintShow = 'Click to show';</script>
    <script>let toggleHintHide = 'Click to hide';</script>
    <script>let toggleOpenOnPrint = 'true';</script>
    <script src="../_static/togglebutton.js"></script>
    <script src="../_static/scripts/sphinx-book-theme.js?digest=f31d14ad54b65d19161ba51d4ffff3a77ae00456"></script>
    <script>var togglebuttonSelector = '.toggle, .admonition.dropdown, .tag_hide_input div.cell_input, .tag_hide-input div.cell_input, .tag_hide_output div.cell_output, .tag_hide-output div.cell_output, .tag_hide_cell.cell, .tag_hide-cell.cell';</script>
    <link rel="index" title="索引" href="../genindex.html" />
    <link rel="search" title="搜索" href="../search.html" />
    <link rel="next" title="AI硬件加速设备" href="2.4%20AI%E7%A1%AC%E4%BB%B6%E5%8A%A0%E9%80%9F%E8%AE%BE%E5%A4%87.html" />
    <link rel="prev" title="2.2 自动求导" href="2.2%20%E8%87%AA%E5%8A%A8%E6%B1%82%E5%AF%BC.html" />
    <meta name="viewport" content="width=device-width, initial-scale=1" />
    <meta name="docsearch:language" content="zh">
    

    <!-- Google Analytics -->
    
  </head>
  <body data-spy="scroll" data-target="#bd-toc-nav" data-offset="60">
<!-- Checkboxes to toggle the left sidebar -->
<input type="checkbox" class="sidebar-toggle" name="__navigation" id="__navigation" aria-label="Toggle navigation sidebar">
<label class="overlay overlay-navbar" for="__navigation">
    <div class="visually-hidden">Toggle navigation sidebar</div>
</label>
<!-- Checkboxes to toggle the in-page toc -->
<input type="checkbox" class="sidebar-toggle" name="__page-toc" id="__page-toc" aria-label="Toggle in-page Table of Contents">
<label class="overlay overlay-pagetoc" for="__page-toc">
    <div class="visually-hidden">Toggle in-page Table of Contents</div>
</label>
<!-- Headers at the top -->
<div class="announcement header-item noprint"></div>
<div class="header header-item noprint"></div>

    
    <div class="container-fluid" id="banner"></div>

    

    <div class="container-xl">
      <div class="row">
          
<!-- Sidebar -->
<div class="bd-sidebar noprint" id="site-navigation">
    <div class="bd-sidebar__content">
        <div class="bd-sidebar__top"><div class="navbar-brand-box">
    <a class="navbar-brand text-wrap" href="../index.html">
      
      
      
      <h1 class="site-logo" id="site-title">深入浅出PyTorch</h1>
      
    </a>
</div><form class="bd-search d-flex align-items-center" action="../search.html" method="get">
  <i class="icon fas fa-search"></i>
  <input type="search" class="form-control" name="q" id="search-input" placeholder="Search the docs ..." aria-label="Search the docs ..." autocomplete="off" >
</form><nav class="bd-links" id="bd-docs-nav" aria-label="Main">
    <div class="bd-toc-item active">
        <p aria-level="2" class="caption" role="heading">
 <span class="caption-text">
  目录
 </span>
</p>
<ul class="current nav bd-sidenav">
 <li class="toctree-l1 has-children">
  <a class="reference internal" href="../%E7%AC%AC%E9%9B%B6%E7%AB%A0/index.html">
   第零章：前置知识
  </a>
  <input class="toctree-checkbox" id="toctree-checkbox-1" name="toctree-checkbox-1" type="checkbox"/>
  <label for="toctree-checkbox-1">
   <i class="fas fa-chevron-down">
   </i>
  </label>
  <ul>
   <li class="toctree-l2">
    <a class="reference internal" href="../%E7%AC%AC%E9%9B%B6%E7%AB%A0/0.1%20%E4%BA%BA%E5%B7%A5%E6%99%BA%E8%83%BD%E7%AE%80%E5%8F%B2.html">
     人工智能简史
    </a>
   </li>
   <li class="toctree-l2">
    <a class="reference internal" href="../%E7%AC%AC%E9%9B%B6%E7%AB%A0/0.2%20%E8%AF%84%E4%BB%B7%E6%8C%87%E6%A0%87.html">
     模型评价指标
    </a>
   </li>
   <li class="toctree-l2">
    <a class="reference internal" href="../%E7%AC%AC%E9%9B%B6%E7%AB%A0/0.3%20%E5%B8%B8%E7%94%A8%E5%8C%85%E7%9A%84%E5%AD%A6%E4%B9%A0.html">
     常用包的学习
    </a>
   </li>
   <li class="toctree-l2">
    <a class="reference internal" href="../%E7%AC%AC%E9%9B%B6%E7%AB%A0/0.4%20Jupyter%E7%9B%B8%E5%85%B3%E6%93%8D%E4%BD%9C.html">
     Jupyter notebook/Lab 简述
    </a>
   </li>
  </ul>
 </li>
 <li class="toctree-l1 has-children">
  <a class="reference internal" href="../%E7%AC%AC%E4%B8%80%E7%AB%A0/index.html">
   第一章：PyTorch的简介和安装
  </a>
  <input class="toctree-checkbox" id="toctree-checkbox-2" name="toctree-checkbox-2" type="checkbox"/>
  <label for="toctree-checkbox-2">
   <i class="fas fa-chevron-down">
   </i>
  </label>
  <ul>
   <li class="toctree-l2">
    <a class="reference internal" href="../%E7%AC%AC%E4%B8%80%E7%AB%A0/1.1%20PyTorch%E7%AE%80%E4%BB%8B.html">
     1.1 PyTorch简介
    </a>
   </li>
   <li class="toctree-l2">
    <a class="reference internal" href="../%E7%AC%AC%E4%B8%80%E7%AB%A0/1.2%20PyTorch%E7%9A%84%E5%AE%89%E8%A3%85.html">
     1.2 PyTorch的安装
    </a>
   </li>
   <li class="toctree-l2">
    <a class="reference internal" href="../%E7%AC%AC%E4%B8%80%E7%AB%A0/1.3%20PyTorch%E7%9B%B8%E5%85%B3%E8%B5%84%E6%BA%90.html">
     1.3 PyTorch相关资源
    </a>
   </li>
  </ul>
 </li>
 <li class="toctree-l1 current active has-children">
  <a class="reference internal" href="index.html">
   第二章：PyTorch基础知识
  </a>
  <input checked="" class="toctree-checkbox" id="toctree-checkbox-3" name="toctree-checkbox-3" type="checkbox"/>
  <label for="toctree-checkbox-3">
   <i class="fas fa-chevron-down">
   </i>
  </label>
  <ul class="current">
   <li class="toctree-l2">
    <a class="reference internal" href="2.1%20%E5%BC%A0%E9%87%8F.html">
     2.1 张量
    </a>
   </li>
   <li class="toctree-l2">
    <a class="reference internal" href="2.2%20%E8%87%AA%E5%8A%A8%E6%B1%82%E5%AF%BC.html">
     2.2 自动求导
    </a>
   </li>
   <li class="toctree-l2 current active">
    <a class="current reference internal" href="#">
     2.3 并行计算简介
    </a>
   </li>
   <li class="toctree-l2">
    <a class="reference internal" href="2.4%20AI%E7%A1%AC%E4%BB%B6%E5%8A%A0%E9%80%9F%E8%AE%BE%E5%A4%87.html">
     AI硬件加速设备
    </a>
   </li>
  </ul>
 </li>
 <li class="toctree-l1 has-children">
  <a class="reference internal" href="../%E7%AC%AC%E4%B8%89%E7%AB%A0/index.html">
   第三章：PyTorch的主要组成模块
  </a>
  <input class="toctree-checkbox" id="toctree-checkbox-4" name="toctree-checkbox-4" type="checkbox"/>
  <label for="toctree-checkbox-4">
   <i class="fas fa-chevron-down">
   </i>
  </label>
  <ul>
   <li class="toctree-l2">
    <a class="reference internal" href="../%E7%AC%AC%E4%B8%89%E7%AB%A0/3.1%20%E6%80%9D%E8%80%83%EF%BC%9A%E5%AE%8C%E6%88%90%E6%B7%B1%E5%BA%A6%E5%AD%A6%E4%B9%A0%E7%9A%84%E5%BF%85%E8%A6%81%E9%83%A8%E5%88%86.html">
     3.1 思考：完成深度学习的必要部分
    </a>
   </li>
   <li class="toctree-l2">
    <a class="reference internal" href="../%E7%AC%AC%E4%B8%89%E7%AB%A0/3.2%20%E5%9F%BA%E6%9C%AC%E9%85%8D%E7%BD%AE.html">
     3.2 基本配置
    </a>
   </li>
   <li class="toctree-l2">
    <a class="reference internal" href="../%E7%AC%AC%E4%B8%89%E7%AB%A0/3.3%20%E6%95%B0%E6%8D%AE%E8%AF%BB%E5%85%A5.html">
     3.3 数据读入
    </a>
   </li>
   <li class="toctree-l2">
    <a class="reference internal" href="../%E7%AC%AC%E4%B8%89%E7%AB%A0/3.4%20%E6%A8%A1%E5%9E%8B%E6%9E%84%E5%BB%BA.html">
     3.4 模型构建
    </a>
   </li>
   <li class="toctree-l2">
    <a class="reference internal" href="../%E7%AC%AC%E4%B8%89%E7%AB%A0/3.5%20%E6%A8%A1%E5%9E%8B%E5%88%9D%E5%A7%8B%E5%8C%96.html">
     3.5 模型初始化
    </a>
   </li>
   <li class="toctree-l2">
    <a class="reference internal" href="../%E7%AC%AC%E4%B8%89%E7%AB%A0/3.6%20%E6%8D%9F%E5%A4%B1%E5%87%BD%E6%95%B0.html">
     3.6 损失函数
    </a>
   </li>
   <li class="toctree-l2">
    <a class="reference internal" href="../%E7%AC%AC%E4%B8%89%E7%AB%A0/3.7%20%E8%AE%AD%E7%BB%83%E4%B8%8E%E8%AF%84%E4%BC%B0.html">
     3.7 训练和评估
    </a>
   </li>
   <li class="toctree-l2">
    <a class="reference internal" href="../%E7%AC%AC%E4%B8%89%E7%AB%A0/3.8%20%E5%8F%AF%E8%A7%86%E5%8C%96.html">
     3.8 可视化
    </a>
   </li>
   <li class="toctree-l2">
    <a class="reference internal" href="../%E7%AC%AC%E4%B8%89%E7%AB%A0/3.9%20%E4%BC%98%E5%8C%96%E5%99%A8.html">
     3.9 PyTorch优化器
    </a>
   </li>
  </ul>
 </li>
 <li class="toctree-l1 has-children">
  <a class="reference internal" href="../%E7%AC%AC%E5%9B%9B%E7%AB%A0/index.html">
   第四章：PyTorch基础实战
  </a>
  <input class="toctree-checkbox" id="toctree-checkbox-5" name="toctree-checkbox-5" type="checkbox"/>
  <label for="toctree-checkbox-5">
   <i class="fas fa-chevron-down">
   </i>
  </label>
  <ul>
   <li class="toctree-l2">
    <a class="reference internal" href="../%E7%AC%AC%E5%9B%9B%E7%AB%A0/4.1%20ResNet.html">
     4.1 ResNet
    </a>
   </li>
   <li class="toctree-l2">
    <a class="reference internal" href="../%E7%AC%AC%E5%9B%9B%E7%AB%A0/4.4%20FashionMNIST%E5%9B%BE%E5%83%8F%E5%88%86%E7%B1%BB.html">
     基础实战——FashionMNIST时装分类
    </a>
   </li>
  </ul>
 </li>
 <li class="toctree-l1 has-children">
  <a class="reference internal" href="../%E7%AC%AC%E4%BA%94%E7%AB%A0/index.html">
   第五章：PyTorch模型定义
  </a>
  <input class="toctree-checkbox" id="toctree-checkbox-6" name="toctree-checkbox-6" type="checkbox"/>
  <label for="toctree-checkbox-6">
   <i class="fas fa-chevron-down">
   </i>
  </label>
  <ul>
   <li class="toctree-l2">
    <a class="reference internal" href="../%E7%AC%AC%E4%BA%94%E7%AB%A0/5.1%20PyTorch%E6%A8%A1%E5%9E%8B%E5%AE%9A%E4%B9%89%E7%9A%84%E6%96%B9%E5%BC%8F.html">
     5.1 PyTorch模型定义的方式
    </a>
   </li>
   <li class="toctree-l2">
    <a class="reference internal" href="../%E7%AC%AC%E4%BA%94%E7%AB%A0/5.2%20%E5%88%A9%E7%94%A8%E6%A8%A1%E5%9E%8B%E5%9D%97%E5%BF%AB%E9%80%9F%E6%90%AD%E5%BB%BA%E5%A4%8D%E6%9D%82%E7%BD%91%E7%BB%9C.html">
     5.2 利用模型块快速搭建复杂网络
    </a>
   </li>
   <li class="toctree-l2">
    <a class="reference internal" href="../%E7%AC%AC%E4%BA%94%E7%AB%A0/5.3%20PyTorch%E4%BF%AE%E6%94%B9%E6%A8%A1%E5%9E%8B.html">
     5.3 PyTorch修改模型
    </a>
   </li>
   <li class="toctree-l2">
    <a class="reference internal" href="../%E7%AC%AC%E4%BA%94%E7%AB%A0/5.4%20PyTorh%E6%A8%A1%E5%9E%8B%E4%BF%9D%E5%AD%98%E4%B8%8E%E8%AF%BB%E5%8F%96.html">
     5.4 PyTorch模型保存与读取
    </a>
   </li>
  </ul>
 </li>
 <li class="toctree-l1 has-children">
  <a class="reference internal" href="../%E7%AC%AC%E5%85%AD%E7%AB%A0/index.html">
   第六章：PyTorch进阶训练技巧
  </a>
  <input class="toctree-checkbox" id="toctree-checkbox-7" name="toctree-checkbox-7" type="checkbox"/>
  <label for="toctree-checkbox-7">
   <i class="fas fa-chevron-down">
   </i>
  </label>
  <ul>
   <li class="toctree-l2">
    <a class="reference internal" href="../%E7%AC%AC%E5%85%AD%E7%AB%A0/6.1%20%E8%87%AA%E5%AE%9A%E4%B9%89%E6%8D%9F%E5%A4%B1%E5%87%BD%E6%95%B0.html">
     6.1 自定义损失函数
    </a>
   </li>
   <li class="toctree-l2">
    <a class="reference internal" href="../%E7%AC%AC%E5%85%AD%E7%AB%A0/6.2%20%E5%8A%A8%E6%80%81%E8%B0%83%E6%95%B4%E5%AD%A6%E4%B9%A0%E7%8E%87.html">
     6.2 动态调整学习率
    </a>
   </li>
   <li class="toctree-l2">
    <a class="reference internal" href="../%E7%AC%AC%E5%85%AD%E7%AB%A0/6.3%20%E6%A8%A1%E5%9E%8B%E5%BE%AE%E8%B0%83-torchvision.html">
     6.3 模型微调-torchvision
    </a>
   </li>
   <li class="toctree-l2">
    <a class="reference internal" href="../%E7%AC%AC%E5%85%AD%E7%AB%A0/6.3%20%E6%A8%A1%E5%9E%8B%E5%BE%AE%E8%B0%83-timm.html">
     6.3 模型微调 - timm
    </a>
   </li>
   <li class="toctree-l2">
    <a class="reference internal" href="../%E7%AC%AC%E5%85%AD%E7%AB%A0/6.4%20%E5%8D%8A%E7%B2%BE%E5%BA%A6%E8%AE%AD%E7%BB%83.html">
     6.4 半精度训练
    </a>
   </li>
   <li class="toctree-l2">
    <a class="reference internal" href="../%E7%AC%AC%E5%85%AD%E7%AB%A0/6.5%20%E6%95%B0%E6%8D%AE%E5%A2%9E%E5%BC%BA-imgaug.html">
     6.5 数据增强-imgaug
    </a>
   </li>
   <li class="toctree-l2">
    <a class="reference internal" href="../%E7%AC%AC%E5%85%AD%E7%AB%A0/6.6%20%E4%BD%BF%E7%94%A8argparse%E8%BF%9B%E8%A1%8C%E8%B0%83%E5%8F%82.html">
     6.6 使用argparse进行调参
    </a>
   </li>
  </ul>
 </li>
 <li class="toctree-l1 has-children">
  <a class="reference internal" href="../%E7%AC%AC%E4%B8%83%E7%AB%A0/index.html">
   第七章：PyTorch可视化
  </a>
  <input class="toctree-checkbox" id="toctree-checkbox-8" name="toctree-checkbox-8" type="checkbox"/>
  <label for="toctree-checkbox-8">
   <i class="fas fa-chevron-down">
   </i>
  </label>
  <ul>
   <li class="toctree-l2">
    <a class="reference internal" href="../%E7%AC%AC%E4%B8%83%E7%AB%A0/7.1%20%E5%8F%AF%E8%A7%86%E5%8C%96%E7%BD%91%E7%BB%9C%E7%BB%93%E6%9E%84.html">
     7.1 可视化网络结构
    </a>
   </li>
   <li class="toctree-l2">
    <a class="reference internal" href="../%E7%AC%AC%E4%B8%83%E7%AB%A0/7.2%20CNN%E5%8D%B7%E7%A7%AF%E5%B1%82%E5%8F%AF%E8%A7%86%E5%8C%96.html">
     7.2 CNN可视化
    </a>
   </li>
   <li class="toctree-l2">
    <a class="reference internal" href="../%E7%AC%AC%E4%B8%83%E7%AB%A0/7.3%20%E4%BD%BF%E7%94%A8TensorBoard%E5%8F%AF%E8%A7%86%E5%8C%96%E8%AE%AD%E7%BB%83%E8%BF%87%E7%A8%8B.html">
     7.3 使用TensorBoard可视化训练过程
    </a>
   </li>
   <li class="toctree-l2">
    <a class="reference internal" href="../%E7%AC%AC%E4%B8%83%E7%AB%A0/7.4%20%E4%BD%BF%E7%94%A8wandb%E5%8F%AF%E8%A7%86%E5%8C%96%E8%AE%AD%E7%BB%83%E8%BF%87%E7%A8%8B.html">
     7.4 使用wandb可视化训练过程
    </a>
   </li>
  </ul>
 </li>
 <li class="toctree-l1 has-children">
  <a class="reference internal" href="../%E7%AC%AC%E5%85%AB%E7%AB%A0/index.html">
   第八章：PyTorch生态简介
  </a>
  <input class="toctree-checkbox" id="toctree-checkbox-9" name="toctree-checkbox-9" type="checkbox"/>
  <label for="toctree-checkbox-9">
   <i class="fas fa-chevron-down">
   </i>
  </label>
  <ul>
   <li class="toctree-l2">
    <a class="reference internal" href="../%E7%AC%AC%E5%85%AB%E7%AB%A0/8.1%20%E6%9C%AC%E7%AB%A0%E7%AE%80%E4%BB%8B.html">
     8.1 本章简介
    </a>
   </li>
   <li class="toctree-l2">
    <a class="reference internal" href="../%E7%AC%AC%E5%85%AB%E7%AB%A0/8.2%20%E5%9B%BE%E5%83%8F%20-%20torchvision.html">
     8.2 torchvision
    </a>
   </li>
   <li class="toctree-l2">
    <a class="reference internal" href="../%E7%AC%AC%E5%85%AB%E7%AB%A0/8.3%20%E8%A7%86%E9%A2%91%20-%20PyTorchVideo.html">
     8.3 PyTorchVideo简介
    </a>
   </li>
   <li class="toctree-l2">
    <a class="reference internal" href="../%E7%AC%AC%E5%85%AB%E7%AB%A0/8.4%20%E6%96%87%E6%9C%AC%20-%20torchtext.html">
     8.4 torchtext简介
    </a>
   </li>
   <li class="toctree-l2">
    <a class="reference internal" href="../%E7%AC%AC%E5%85%AB%E7%AB%A0/8.5%20%E9%9F%B3%E9%A2%91%20-%20torchaudio.html">
     8.5 torchaudio简介
    </a>
   </li>
  </ul>
 </li>
 <li class="toctree-l1 has-children">
  <a class="reference internal" href="../%E7%AC%AC%E4%B9%9D%E7%AB%A0/index.html">
   第九章：PyTorch的模型部署
  </a>
  <input class="toctree-checkbox" id="toctree-checkbox-10" name="toctree-checkbox-10" type="checkbox"/>
  <label for="toctree-checkbox-10">
   <i class="fas fa-chevron-down">
   </i>
  </label>
  <ul>
   <li class="toctree-l2">
    <a class="reference internal" href="../%E7%AC%AC%E4%B9%9D%E7%AB%A0/9.1%20%E4%BD%BF%E7%94%A8ONNX%E8%BF%9B%E8%A1%8C%E9%83%A8%E7%BD%B2%E5%B9%B6%E6%8E%A8%E7%90%86.html">
     9.1 使用ONNX进行部署并推理
    </a>
   </li>
  </ul>
 </li>
 <li class="toctree-l1 has-children">
  <a class="reference internal" href="../%E7%AC%AC%E5%8D%81%E7%AB%A0/index.html">
   第十章：常见代码解读
  </a>
  <input class="toctree-checkbox" id="toctree-checkbox-11" name="toctree-checkbox-11" type="checkbox"/>
  <label for="toctree-checkbox-11">
   <i class="fas fa-chevron-down">
   </i>
  </label>
  <ul>
   <li class="toctree-l2">
    <a class="reference internal" href="../%E7%AC%AC%E5%8D%81%E7%AB%A0/10.1%20%E5%9B%BE%E5%83%8F%E5%88%86%E7%B1%BB.html">
     10.1 图像分类简介（补充中）
    </a>
   </li>
   <li class="toctree-l2">
    <a class="reference internal" href="../%E7%AC%AC%E5%8D%81%E7%AB%A0/10.2%20%E7%9B%AE%E6%A0%87%E6%A3%80%E6%B5%8B.html">
     目标检测简介
    </a>
   </li>
   <li class="toctree-l2">
    <a class="reference internal" href="../%E7%AC%AC%E5%8D%81%E7%AB%A0/10.3%20%E5%9B%BE%E5%83%8F%E5%88%86%E5%89%B2.html">
     10.3 图像分割简介（补充中）
    </a>
   </li>
   <li class="toctree-l2">
    <a class="reference internal" href="../%E7%AC%AC%E5%8D%81%E7%AB%A0/ResNet%E6%BA%90%E7%A0%81%E8%A7%A3%E8%AF%BB.html">
     ResNet源码解读
    </a>
   </li>
   <li class="toctree-l2">
    <a class="reference internal" href="../%E7%AC%AC%E5%8D%81%E7%AB%A0/RNN%E8%AF%A6%E8%A7%A3%E5%8F%8A%E5%85%B6%E5%AE%9E%E7%8E%B0.html">
     文章结构
    </a>
   </li>
   <li class="toctree-l2">
    <a class="reference internal" href="../%E7%AC%AC%E5%8D%81%E7%AB%A0/LSTM%E8%A7%A3%E8%AF%BB%E5%8F%8A%E5%AE%9E%E6%88%98.html">
     文章结构
    </a>
   </li>
   <li class="toctree-l2">
    <a class="reference internal" href="../%E7%AC%AC%E5%8D%81%E7%AB%A0/Transformer%20%E8%A7%A3%E8%AF%BB.html">
     Transformer 解读
    </a>
   </li>
   <li class="toctree-l2">
    <a class="reference internal" href="../%E7%AC%AC%E5%8D%81%E7%AB%A0/ViT%E8%A7%A3%E8%AF%BB.html">
     ViT解读
    </a>
   </li>
   <li class="toctree-l2">
    <a class="reference internal" href="../%E7%AC%AC%E5%8D%81%E7%AB%A0/Swin-Transformer%E8%A7%A3%E8%AF%BB.html">
     Swin Transformer解读
    </a>
   </li>
  </ul>
 </li>
</ul>

    </div>
</nav></div>
        <div class="bd-sidebar__bottom">
             <!-- To handle the deprecated key -->
            
            <div class="navbar_extra_footer">
            Theme by the <a href="https://ebp.jupyterbook.org">Executable Book Project</a>
            </div>
            
        </div>
    </div>
    <div id="rtd-footer-container"></div>
</div>


          


          
<!-- A tiny helper pixel to detect if we've scrolled -->
<div class="sbt-scroll-pixel-helper"></div>
<!-- Main content -->
<div class="col py-0 content-container">
    
    <div class="header-article row sticky-top noprint">
        



<div class="col py-1 d-flex header-article-main">
    <div class="header-article__left">
        
        <label for="__navigation"
  class="headerbtn"
  data-toggle="tooltip"
data-placement="right"
title="Toggle navigation"
>
  

<span class="headerbtn__icon-container">
  <i class="fas fa-bars"></i>
  </span>

</label>

        
    </div>
    <div class="header-article__right">
<button onclick="toggleFullScreen()"
  class="headerbtn"
  data-toggle="tooltip"
data-placement="bottom"
title="Fullscreen mode"
>
  

<span class="headerbtn__icon-container">
  <i class="fas fa-expand"></i>
  </span>

</button>

<div class="menu-dropdown menu-dropdown-repository-buttons">
  <button class="headerbtn menu-dropdown__trigger"
      aria-label="Source repositories">
      <i class="fab fa-github"></i>
  </button>
  <div class="menu-dropdown__content">
    <ul>
      <li>
        <a href="https://github.com/datawhalechina/thorough-pytorch"
   class="headerbtn"
   data-toggle="tooltip"
data-placement="left"
title="Source repository"
>
  

<span class="headerbtn__icon-container">
  <i class="fab fa-github"></i>
  </span>
<span class="headerbtn__text-container">repository</span>
</a>

      </li>
      
      <li>
        <a href="https://github.com/datawhalechina/thorough-pytorch/issues/new?title=Issue%20on%20page%20%2F第二章/2.3 并行计算简介.html&body=Your%20issue%20content%20here."
   class="headerbtn"
   data-toggle="tooltip"
data-placement="left"
title="Open an issue"
>
  

<span class="headerbtn__icon-container">
  <i class="fas fa-lightbulb"></i>
  </span>
<span class="headerbtn__text-container">open issue</span>
</a>

      </li>
      
      <li>
        <a href="https://github.com/datawhalechina/thorough-pytorch/edit/master/第二章/2.3 并行计算简介.md"
   class="headerbtn"
   data-toggle="tooltip"
data-placement="left"
title="Edit this page"
>
  

<span class="headerbtn__icon-container">
  <i class="fas fa-pencil-alt"></i>
  </span>
<span class="headerbtn__text-container">suggest edit</span>
</a>

      </li>
      
    </ul>
  </div>
</div>

<div class="menu-dropdown menu-dropdown-download-buttons">
  <button class="headerbtn menu-dropdown__trigger"
      aria-label="Download this page">
      <i class="fas fa-download"></i>
  </button>
  <div class="menu-dropdown__content">
    <ul>
      <li>
        <a href="../_sources/第二章/2.3 并行计算简介.md.txt"
   class="headerbtn"
   data-toggle="tooltip"
data-placement="left"
title="Download source file"
>
  

<span class="headerbtn__icon-container">
  <i class="fas fa-file"></i>
  </span>
<span class="headerbtn__text-container">.md</span>
</a>

      </li>
      
      <li>
        
<button onclick="printPdf(this)"
  class="headerbtn"
  data-toggle="tooltip"
data-placement="left"
title="Print to PDF"
>
  

<span class="headerbtn__icon-container">
  <i class="fas fa-file-pdf"></i>
  </span>
<span class="headerbtn__text-container">.pdf</span>
</button>

      </li>
      
    </ul>
  </div>
</div>
<label for="__page-toc"
  class="headerbtn headerbtn-page-toc"
  
>
  

<span class="headerbtn__icon-container">
  <i class="fas fa-list"></i>
  </span>

</label>

    </div>
</div>

<!-- Table of contents -->
<div class="col-md-3 bd-toc show noprint">
    <div class="tocsection onthispage pt-5 pb-3">
        <i class="fas fa-list"></i> Contents
    </div>
    <nav id="bd-toc-nav" aria-label="Page">
        <ul class="visible nav section-nav flex-column">
 <li class="toc-h2 nav-item toc-entry">
  <a class="reference internal nav-link" href="#id2">
   2.3.1  为什么要做并行计算
  </a>
 </li>
 <li class="toc-h2 nav-item toc-entry">
  <a class="reference internal nav-link" href="#cuda">
   2.3.2  为什么需要CUDA
  </a>
 </li>
 <li class="toc-h2 nav-item toc-entry">
  <a class="reference internal nav-link" href="#id3">
   2.3.3  常见的并行的方法：
  </a>
  <ul class="nav section-nav flex-column">
   <li class="toc-h3 nav-item toc-entry">
    <a class="reference internal nav-link" href="#network-partitioning">
     网络结构分布到不同的设备中(Network partitioning)
    </a>
   </li>
   <li class="toc-h3 nav-item toc-entry">
    <a class="reference internal nav-link" href="#layer-wise-partitioning">
     同一层的任务分布到不同数据中(Layer-wise partitioning)
    </a>
   </li>
   <li class="toc-h3 nav-item toc-entry">
    <a class="reference internal nav-link" href="#data-parallelism">
     不同的数据分布到不同的设备中，执行相同的任务(Data parallelism)
    </a>
   </li>
  </ul>
 </li>
 <li class="toc-h2 nav-item toc-entry">
  <a class="reference internal nav-link" href="#id4">
   2.3.4 使用CUDA加速训练
  </a>
  <ul class="nav section-nav flex-column">
   <li class="toc-h3 nav-item toc-entry">
    <a class="reference internal nav-link" href="#id5">
     单卡训练
    </a>
   </li>
   <li class="toc-h3 nav-item toc-entry">
    <a class="reference internal nav-link" href="#id6">
     多卡训练
    </a>
    <ul class="nav section-nav flex-column">
     <li class="toc-h4 nav-item toc-entry">
      <a class="reference internal nav-link" href="#dp">
       单机多卡DP
      </a>
     </li>
     <li class="toc-h4 nav-item toc-entry">
      <a class="reference internal nav-link" href="#ddp">
       多机多卡DDP
      </a>
     </li>
    </ul>
   </li>
   <li class="toc-h3 nav-item toc-entry">
    <a class="reference internal nav-link" href="#dp-ddp">
     DP 与 DDP 的优缺点
    </a>
    <ul class="nav section-nav flex-column">
     <li class="toc-h4 nav-item toc-entry">
      <a class="reference internal nav-link" href="#id7">
       DP 的优势
      </a>
     </li>
     <li class="toc-h4 nav-item toc-entry">
      <a class="reference internal nav-link" href="#id8">
       DP 的缺点
      </a>
     </li>
     <li class="toc-h4 nav-item toc-entry">
      <a class="reference internal nav-link" href="#id9">
       DDP的优势
      </a>
     </li>
     <li class="toc-h4 nav-item toc-entry">
      <a class="reference internal nav-link" href="#id10">
       DDP 的缺点
      </a>
     </li>
    </ul>
   </li>
  </ul>
 </li>
 <li class="toc-h2 nav-item toc-entry">
  <a class="reference internal nav-link" href="#id11">
   参考资料：
  </a>
 </li>
</ul>

    </nav>
</div>
    </div>
    <div class="article row">
        <div class="col pl-md-3 pl-lg-5 content-container">
            <!-- Table of contents that is only displayed when printing the page -->
            <div id="jb-print-docs-body" class="onlyprint">
                <h1>2.3 并行计算简介</h1>
                <!-- Table of contents -->
                <div id="print-main-content">
                    <div id="jb-print-toc">
                        
                        <div>
                            <h2> Contents </h2>
                        </div>
                        <nav aria-label="Page">
                            <ul class="visible nav section-nav flex-column">
 <li class="toc-h2 nav-item toc-entry">
  <a class="reference internal nav-link" href="#id2">
   2.3.1  为什么要做并行计算
  </a>
 </li>
 <li class="toc-h2 nav-item toc-entry">
  <a class="reference internal nav-link" href="#cuda">
   2.3.2  为什么需要CUDA
  </a>
 </li>
 <li class="toc-h2 nav-item toc-entry">
  <a class="reference internal nav-link" href="#id3">
   2.3.3  常见的并行的方法：
  </a>
  <ul class="nav section-nav flex-column">
   <li class="toc-h3 nav-item toc-entry">
    <a class="reference internal nav-link" href="#network-partitioning">
     网络结构分布到不同的设备中(Network partitioning)
    </a>
   </li>
   <li class="toc-h3 nav-item toc-entry">
    <a class="reference internal nav-link" href="#layer-wise-partitioning">
     同一层的任务分布到不同数据中(Layer-wise partitioning)
    </a>
   </li>
   <li class="toc-h3 nav-item toc-entry">
    <a class="reference internal nav-link" href="#data-parallelism">
     不同的数据分布到不同的设备中，执行相同的任务(Data parallelism)
    </a>
   </li>
  </ul>
 </li>
 <li class="toc-h2 nav-item toc-entry">
  <a class="reference internal nav-link" href="#id4">
   2.3.4 使用CUDA加速训练
  </a>
  <ul class="nav section-nav flex-column">
   <li class="toc-h3 nav-item toc-entry">
    <a class="reference internal nav-link" href="#id5">
     单卡训练
    </a>
   </li>
   <li class="toc-h3 nav-item toc-entry">
    <a class="reference internal nav-link" href="#id6">
     多卡训练
    </a>
    <ul class="nav section-nav flex-column">
     <li class="toc-h4 nav-item toc-entry">
      <a class="reference internal nav-link" href="#dp">
       单机多卡DP
      </a>
     </li>
     <li class="toc-h4 nav-item toc-entry">
      <a class="reference internal nav-link" href="#ddp">
       多机多卡DDP
      </a>
     </li>
    </ul>
   </li>
   <li class="toc-h3 nav-item toc-entry">
    <a class="reference internal nav-link" href="#dp-ddp">
     DP 与 DDP 的优缺点
    </a>
    <ul class="nav section-nav flex-column">
     <li class="toc-h4 nav-item toc-entry">
      <a class="reference internal nav-link" href="#id7">
       DP 的优势
      </a>
     </li>
     <li class="toc-h4 nav-item toc-entry">
      <a class="reference internal nav-link" href="#id8">
       DP 的缺点
      </a>
     </li>
     <li class="toc-h4 nav-item toc-entry">
      <a class="reference internal nav-link" href="#id9">
       DDP的优势
      </a>
     </li>
     <li class="toc-h4 nav-item toc-entry">
      <a class="reference internal nav-link" href="#id10">
       DDP 的缺点
      </a>
     </li>
    </ul>
   </li>
  </ul>
 </li>
 <li class="toc-h2 nav-item toc-entry">
  <a class="reference internal nav-link" href="#id11">
   参考资料：
  </a>
 </li>
</ul>

                        </nav>
                    </div>
                </div>
            </div>
            <main id="main-content" role="main">
                
              <div>
                
  <section class="tex2jax_ignore mathjax_ignore" id="id1">
<h1>2.3 并行计算简介<a class="headerlink" href="#id1" title="永久链接至标题">#</a></h1>
<p>在利用PyTorch做深度学习的过程中，可能会遇到数据量较大无法在单块GPU上完成，或者需要提升计算速度的场景，这时就需要用到并行计算。完成本节内容时，请你确保至少安装了一个NVIDIA GPU并安装了相关的驱动。</p>
<p>经过本节的学习，你将收获：</p>
<ul class="simple">
<li><p>并行计算的简介</p></li>
<li><p>CUDA简介</p></li>
<li><p>并行计算的三种实现方式</p></li>
<li><p>使用CUDA加速训练</p></li>
</ul>
<section id="id2">
<h2>2.3.1  为什么要做并行计算<a class="headerlink" href="#id2" title="永久链接至标题">#</a></h2>
<p>深度学习的发展离不开算力的发展，GPU的出现让我们的模型可以训练的更快，更好。所以，如何充分利用GPU的性能来提高我们模型学习的效果，这一技能是我们必须要学习的。这一节，我们主要讲的就是PyTorch的并行计算。PyTorch可以在编写完模型之后，让多个GPU来参与训练，减少训练时间。你可以在命令行使用<code class="docutils literal notranslate"><span class="pre">nvidia-smi</span></code>命令来查看你的GPU信息和使用情况。</p>
</section>
<section id="cuda">
<h2>2.3.2  为什么需要CUDA<a class="headerlink" href="#cuda" title="永久链接至标题">#</a></h2>
<p><code class="docutils literal notranslate"><span class="pre">CUDA</span></code>是NVIDIA提供的一种GPU并行计算框架。对于GPU本身的编程，使用的是<code class="docutils literal notranslate"><span class="pre">CUDA</span></code>语言来实现的。但是，在我们使用PyTorch编写深度学习代码时，使用的<code class="docutils literal notranslate"><span class="pre">CUDA</span></code>又是另一个意思。在PyTorch使用 <code class="docutils literal notranslate"><span class="pre">CUDA</span></code>表示要开始要求我们的模型或者数据开始使用GPU了。</p>
<p>在编写程序中，当我们使用了 <code class="docutils literal notranslate"><span class="pre">.cuda()</span></code> 时，其功能是让我们的模型或者数据从CPU迁移到GPU上（默认是0号GPU）当中，通过GPU开始计算。</p>
<p>注：</p>
<ol>
<li><p>我们使用GPU时使用的是<code class="docutils literal notranslate"><span class="pre">.cuda()</span></code>而不是使用<code class="docutils literal notranslate"><span class="pre">.gpu()</span></code>。这是因为当前GPU的编程接口采用CUDA，但是市面上的GPU并不是都支持CUDA，只有部分NVIDIA的GPU才支持，AMD的GPU编程接口采用的是OpenCL，在现阶段PyTorch并不支持。</p></li>
<li><p>数据在GPU和CPU之间进行传递时会比较耗时，我们应当尽量避免数据的切换。</p></li>
<li><p>GPU运算很快，但是在使用简单的操作时，我们应该尽量使用CPU去完成。</p></li>
<li><p>当我们的服务器上有多个GPU，我们应该指明我们使用的GPU是哪一块，如果我们不设置的话，tensor.cuda()方法会默认将tensor保存到第一块GPU上，等价于tensor.cuda(0)，这将有可能导致爆出<code class="docutils literal notranslate"><span class="pre">out</span> <span class="pre">of</span> <span class="pre">memory</span></code>的错误。我们可以通过以下两种方式继续设置。</p>
<ol>
<li><div class="highlight-python notranslate"><div class="highlight"><pre><span></span> <span class="c1">#设置在文件最开始部分</span>
<span class="kn">import</span> <span class="nn">os</span>
<span class="n">os</span><span class="o">.</span><span class="n">environ</span><span class="p">[</span><span class="s2">&quot;CUDA_VISIBLE_DEVICE&quot;</span><span class="p">]</span> <span class="o">=</span> <span class="s2">&quot;2&quot;</span> <span class="c1"># 设置默认的显卡</span>
</pre></div>
</div>
</li>
<li><div class="highlight-bash notranslate"><div class="highlight"><pre><span></span> <span class="nv">CUDA_VISBLE_DEVICE</span><span class="o">=</span><span class="m">0</span>,1 python train.py <span class="c1"># 使用0，1两块GPU</span>
</pre></div>
</div>
</li>
</ol>
</li>
</ol>
</section>
<section id="id3">
<h2>2.3.3  常见的并行的方法：<a class="headerlink" href="#id3" title="永久链接至标题">#</a></h2>
<section id="network-partitioning">
<h3>网络结构分布到不同的设备中(Network partitioning)<a class="headerlink" href="#network-partitioning" title="永久链接至标题">#</a></h3>
<p>在刚开始做模型并行的时候，这个方案使用的比较多。其中主要的思路是，将一个模型的各个部分拆分，然后将不同的部分放入到GPU来做不同任务的计算。其架构如下：</p>
<p><img alt="模型并行.png" src="../_images/model_parllel.png" /></p>
<p>这里遇到的问题就是，不同模型组件在不同的GPU上时，GPU之间的传输就很重要，对于GPU之间的通信是一个考验。但是GPU的通信在这种密集任务中很难办到，所以这个方式慢慢淡出了视野。</p>
</section>
<section id="layer-wise-partitioning">
<h3>同一层的任务分布到不同数据中(Layer-wise partitioning)<a class="headerlink" href="#layer-wise-partitioning" title="永久链接至标题">#</a></h3>
<p>第二种方式就是，同一层的模型做一个拆分，让不同的GPU去训练同一层模型的部分任务。其架构如下：</p>
<p><img alt="拆分.png" src="../_images/split.png" /></p>
<p>这样可以保证在不同组件之间传输的问题，但是在我们需要大量的训练，同步任务加重的情况下，会出现和第一种方式一样的问题。</p>
</section>
<section id="data-parallelism">
<h3>不同的数据分布到不同的设备中，执行相同的任务(Data parallelism)<a class="headerlink" href="#data-parallelism" title="永久链接至标题">#</a></h3>
<p>第三种方式有点不一样，它的逻辑是，我不再拆分模型，我训练的时候模型都是一整个模型。但是我将输入的数据拆分。所谓的拆分数据就是，同一个模型在不同GPU中训练一部分数据，然后再分别计算一部分数据之后，只需要将输出的数据做一个汇总，然后再反传。其架构如下：</p>
<p><img alt="数据并行.png" src="../_images/data_parllel.png" /></p>
<p>这种方式可以解决之前模式遇到的通讯问题。现在的主流方式是<strong>数据并行</strong>的方式(Data parallelism)</p>
</section>
</section>
<section id="id4">
<h2>2.3.4 使用CUDA加速训练<a class="headerlink" href="#id4" title="永久链接至标题">#</a></h2>
<section id="id5">
<h3>单卡训练<a class="headerlink" href="#id5" title="永久链接至标题">#</a></h3>
<p>在PyTorch框架下，CUDA的使用变得非常简单，我们只需要显式的将数据和模型通过<code class="docutils literal notranslate"><span class="pre">.cuda()</span></code>方法转移到GPU上就可加速我们的训练。如下：</p>
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="n">model</span> <span class="o">=</span> <span class="n">Net</span><span class="p">()</span>
<span class="n">model</span><span class="o">.</span><span class="n">cuda</span><span class="p">()</span> <span class="c1"># 模型显示转移到CUDA上</span>

<span class="k">for</span> <span class="n">image</span><span class="p">,</span><span class="n">label</span> <span class="ow">in</span> <span class="n">dataloader</span><span class="p">:</span>
    <span class="c1"># 图像和标签显示转移到CUDA上</span>
    <span class="n">image</span> <span class="o">=</span> <span class="n">image</span><span class="o">.</span><span class="n">cuda</span><span class="p">()</span> 
    <span class="n">label</span> <span class="o">=</span> <span class="n">label</span><span class="o">.</span><span class="n">cuda</span><span class="p">()</span>
</pre></div>
</div>
</section>
<section id="id6">
<h3>多卡训练<a class="headerlink" href="#id6" title="永久链接至标题">#</a></h3>
<p>PyTorch提供了两种多卡训练的方式，分别为<code class="docutils literal notranslate"><span class="pre">DataParallel</span></code>和<code class="docutils literal notranslate"><span class="pre">DistributedDataParallel</span></code>（以下我们分别简称为DP和DDP）。这两种方法中官方更推荐我们使用<code class="docutils literal notranslate"><span class="pre">DDP</span></code>，因为它的性能更好。但是<code class="docutils literal notranslate"><span class="pre">DDP</span></code>的使用比较复杂，而<code class="docutils literal notranslate"><span class="pre">DP</span></code>经需要改变几行代码既可以实现，所以我们这里先介绍<code class="docutils literal notranslate"><span class="pre">DP</span></code>，再介绍<code class="docutils literal notranslate"><span class="pre">DDP</span></code>。</p>
<section id="dp">
<h4>单机多卡DP<a class="headerlink" href="#dp" title="永久链接至标题">#</a></h4>
<p><img alt="DP.png" src="../_images/DP.png" /></p>
<p>首先我们来看单机多卡DP，通常使用一种叫做数据并行 (Data parallelism) 的策略，即将计算任务划分成多个子任务并在多个GPU卡上同时执行这些子任务。主要使用到了<code class="docutils literal notranslate"><span class="pre">nn.DataParallel</span></code>函数，它的使用非常简单，一般我们只需要加几行代码即可实现</p>
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="n">model</span> <span class="o">=</span> <span class="n">Net</span><span class="p">()</span>
<span class="n">model</span><span class="o">.</span><span class="n">cuda</span><span class="p">()</span> <span class="c1"># 模型显示转移到CUDA上</span>

<span class="k">if</span> <span class="n">torch</span><span class="o">.</span><span class="n">cuda</span><span class="o">.</span><span class="n">device_count</span><span class="p">()</span> <span class="o">&gt;</span> <span class="mi">1</span><span class="p">:</span> <span class="c1"># 含有多张GPU的卡</span>
	<span class="n">model</span> <span class="o">=</span> <span class="n">nn</span><span class="o">.</span><span class="n">DataParallel</span><span class="p">(</span><span class="n">model</span><span class="p">)</span> <span class="c1"># 单机多卡DP训练</span>
</pre></div>
</div>
<p>除此之外，我们也可以指定GPU进行并行训练，一般有两种方式</p>
<ul>
<li><p><code class="docutils literal notranslate"><span class="pre">nn.DataParallel</span></code>函数传入<code class="docutils literal notranslate"><span class="pre">device_ids</span></code>参数，可以指定了使用的GPU编号</p>
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="n">model</span> <span class="o">=</span> <span class="n">nn</span><span class="o">.</span><span class="n">DataParallel</span><span class="p">(</span><span class="n">model</span><span class="p">,</span> <span class="n">device_ids</span><span class="o">=</span><span class="p">[</span><span class="mi">0</span><span class="p">,</span><span class="mi">1</span><span class="p">])</span> <span class="c1"># 使用第0和第1张卡进行并行训练</span>
</pre></div>
</div>
</li>
<li><p>要<strong>手动指定对程序可见的GPU设备</strong></p>
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="n">os</span><span class="o">.</span><span class="n">environ</span><span class="p">[</span><span class="s2">&quot;CUDA_VISIBLE_DEVICES&quot;</span><span class="p">]</span> <span class="o">=</span> <span class="s2">&quot;1,2&quot;</span>
</pre></div>
</div>
</li>
</ul>
</section>
<section id="ddp">
<h4>多机多卡DDP<a class="headerlink" href="#ddp" title="永久链接至标题">#</a></h4>
<p><img alt="DP.png" src="../_images/DDP.png" /></p>
<p>不过通过DP进行分布式多卡训练的方式容易造成负载不均衡，有可能第一块GPU显存占用更多，因为输出默认都会被gather到第一块GPU上。为此Pytorch也提供了<code class="docutils literal notranslate"><span class="pre">torch.nn.parallel.DistributedDataParallel</span></code>（DDP）方法来解决这个问题。</p>
<p>针对每个GPU，启动一个进程，然后这些进程在最开始的时候会保持一致（模型的初始化参数也一致，每个进程拥有自己的优化器），同时在更新模型的时候，梯度传播也是完全一致的，这样就可以保证任何一个GPU上面的模型参数就是完全一致的，所以这样就不会出现<code class="docutils literal notranslate"><span class="pre">DataParallel</span></code>那样显存不均衡的问题。不过相对应的，会比较麻烦，接下来介绍一下多机多卡DDP的使用方法。</p>
<p>开始之前需要先熟悉几个概念，这些还是有必要提一下的</p>
<p><strong>进程组的相关概念</strong></p>
<ul class="simple">
<li><p><strong>GROUP</strong>：进程组，默认情况下，只有一个组，一个 job 即为一个组，也即一个 world。（当需要进行更加精细的通信时，可以通过 new_group 接口，使用 world 的子集，创建新组，用于集体通信等。）</p></li>
<li><p><strong>WORLD_SIZE</strong>：表示全局进程个数。如果是多机多卡就表示机器数量，如果是单机多卡就表示 GPU 数量。</p></li>
<li><p><strong>RANK</strong>：表示进程序号，用于进程间通讯，表征进程优先级。rank = 0 的主机为 master 节点。 如果是多机多卡就表示对应第几台机器，如果是单机多卡，由于一个进程内就只有一个 GPU，所以 rank 也就表示第几块 GPU。</p></li>
<li><p><strong>LOCAL_RANK</strong>：表示进程内，GPU 编号，非显式参数，由 torch.distributed.launch 内部指定。例如，多机多卡中 rank = 3，local_rank = 0 表示第 3 个进程内的第 1 块 GPU。</p></li>
</ul>
<p><strong>DDP的基本用法 (代码编写流程)</strong></p>
<ul class="simple">
<li><p>在使用 <code class="docutils literal notranslate"><span class="pre">distributed</span></code> 包的任何其他函数之前，需要使用 <code class="docutils literal notranslate"><span class="pre">init_process_group</span></code> <strong>初始化进程组</strong>，同时初始化 <code class="docutils literal notranslate"><span class="pre">distributed</span></code> 包。</p></li>
<li><p>使用 <code class="docutils literal notranslate"><span class="pre">torch.nn.parallel.DistributedDataParallel</span></code> 创建 <strong>分布式模型</strong> <code class="docutils literal notranslate"><span class="pre">DDP(model,</span> <span class="pre">device_ids=device_ids)</span></code></p></li>
<li><p>使用 <code class="docutils literal notranslate"><span class="pre">torch.utils.data.distributed.DistributedSampler</span></code> 创建 <strong>DataLoader</strong></p></li>
<li><p>使用启动工具 <code class="docutils literal notranslate"><span class="pre">torch.distributed.launch</span></code> 在每个主机上执行一次脚本，开始训练</p></li>
</ul>
<p>首先是对代码进行修改，添加参数  --local_rank</p>
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="kn">import</span> <span class="nn">argparse</span>
<span class="n">parser</span> <span class="o">=</span> <span class="n">argparse</span><span class="o">.</span><span class="n">ArgumentParser</span><span class="p">()</span>
<span class="n">parser</span><span class="o">.</span><span class="n">add_argument</span><span class="p">(</span><span class="s2">&quot;--local_rank&quot;</span><span class="p">,</span> <span class="nb">type</span><span class="o">=</span><span class="nb">int</span><span class="p">)</span> <span class="c1"># 这个参数很重要</span>
<span class="n">args</span> <span class="o">=</span> <span class="n">parser</span><span class="o">.</span><span class="n">parse_args</span><span class="p">()</span>
</pre></div>
</div>
<p>这里的local_rank参数，可以理解为<code class="docutils literal notranslate"><span class="pre">torch.distributed.launch</span></code>在给一个GPU创建进程的时候，给这个进程提供的GPU号，这个是程序自动给的，<strong>不需要手动在命令行中指定这个参数。</strong></p>
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="n">local_rank</span> <span class="o">=</span> <span class="nb">int</span><span class="p">(</span><span class="n">os</span><span class="o">.</span><span class="n">environ</span><span class="p">[</span><span class="s2">&quot;LOCAL_RANK&quot;</span><span class="p">])</span> <span class="c1">#也可以自动获取</span>
</pre></div>
</div>
<p>然后在所有和GPU相关代码的前面添加如下代码，如果不写这句代码，所有的进程都默认在你使用<code class="docutils literal notranslate"><span class="pre">CUDA_VISIBLE_DEVICES</span></code>参数设定的0号GPU上面启动</p>
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="n">torch</span><span class="o">.</span><span class="n">cuda</span><span class="o">.</span><span class="n">set_device</span><span class="p">(</span><span class="n">args</span><span class="o">.</span><span class="n">local_rank</span><span class="p">)</span> <span class="c1"># 调整计算的位置</span>
</pre></div>
</div>
<p>接下来我们得初始化<code class="docutils literal notranslate"><span class="pre">backend</span></code>，也就是俗称的后端，pytorch介绍了以下后端：</p>
<p><img alt="Backends.png" src="../_images/backends.png" /></p>
<p>可以看到，提供了<code class="docutils literal notranslate"><span class="pre">gloo</span></code>，<code class="docutils literal notranslate"><span class="pre">nccl</span></code>，<code class="docutils literal notranslate"><span class="pre">mpi</span></code>，那么如何进行选择呢，官网中也给了以下建议</p>
<ul class="simple">
<li><p>经验之谈</p>
<ul>
<li><p>如果是使用<code class="docutils literal notranslate"><span class="pre">cpu</span></code>的分布式计算, 建议使用<code class="docutils literal notranslate"><span class="pre">gloo</span></code>，因为表中可以看到 <code class="docutils literal notranslate"><span class="pre">gloo</span></code>对<code class="docutils literal notranslate"><span class="pre">cpu</span></code>的支持是最好的</p></li>
<li><p>如果使用<code class="docutils literal notranslate"><span class="pre">gpu</span></code>进行分布式计算, 建议使用<code class="docutils literal notranslate"><span class="pre">nccl</span></code>。</p></li>
</ul>
</li>
<li><p>GPU主机</p>
<ul>
<li><p>InfiniBand连接，建议使用<code class="docutils literal notranslate"><span class="pre">nccl</span></code>，因为它是目前唯一支持 InfiniBand 和 GPUDirect 的后端。</p></li>
<li><p>Ethernet连接，建议使用<code class="docutils literal notranslate"><span class="pre">nccl</span></code>，因为它的分布式GPU训练性能目前是最好的，特别是对于多进程单节点或多节点分布式训练。 如果在使用 <code class="docutils literal notranslate"><span class="pre">nccl</span></code>时遇到任何问题，可以使用<code class="docutils literal notranslate"><span class="pre">gloo</span></code> 作为后备选项。 （不过注意，对于 GPU，<code class="docutils literal notranslate"><span class="pre">gloo</span></code> 目前的运行速度比 <code class="docutils literal notranslate"><span class="pre">nccl</span></code> 慢。）</p></li>
</ul>
</li>
<li><p>CPU主机</p>
<ul>
<li><p>InfiniBand连接，如果启用了IP over IB，那就使用<code class="docutils literal notranslate"><span class="pre">gloo</span></code>，否则使用<code class="docutils literal notranslate"><span class="pre">mpi</span></code></p></li>
<li><p>Ethernet连接，建议使用<code class="docutils literal notranslate"><span class="pre">gloo</span></code>，除非有不得已的理由使用<code class="docutils literal notranslate"><span class="pre">mpi</span></code>。</p></li>
</ul>
</li>
</ul>
<p>当后端选择好了之后, 我们需要设置一下网络接口, 因为多个主机之间肯定是使用网络进行交换, 那肯定就涉及到IP之类的, 对于<code class="docutils literal notranslate"><span class="pre">nccl</span></code>和<code class="docutils literal notranslate"><span class="pre">gloo</span></code>一般会自己寻找网络接口，不过有时候如果网卡比较多的时候，就需要自己设置，可以利用以下代码</p>
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="kn">import</span> <span class="nn">os</span>
<span class="c1"># 以下二选一, 第一个是使用gloo后端需要设置的, 第二个是使用nccl需要设置的</span>
<span class="n">os</span><span class="o">.</span><span class="n">environ</span><span class="p">[</span><span class="s1">&#39;GLOO_SOCKET_IFNAME&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="s1">&#39;eth0&#39;</span>
<span class="n">os</span><span class="o">.</span><span class="n">environ</span><span class="p">[</span><span class="s1">&#39;NCCL_SOCKET_IFNAME&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="s1">&#39;eth0&#39;</span>
</pre></div>
</div>
<blockquote>
<div><p>可以通过以下操作知道自己的网络接口，输入<code class="docutils literal notranslate"><span class="pre">ifconfig</span></code>, 然后找到自己IP地址的就是, 一般就是<code class="docutils literal notranslate"><span class="pre">em0</span></code>, <code class="docutils literal notranslate"><span class="pre">eth0</span></code>, <code class="docutils literal notranslate"><span class="pre">esp2s0</span></code>之类的,</p>
</div></blockquote>
<p>从以上介绍我们可以看出， 当使用GPU的时候, <code class="docutils literal notranslate"><span class="pre">nccl</span></code>的效率是高于<code class="docutils literal notranslate"><span class="pre">gloo</span></code>的，我们一般还是会选择<code class="docutils literal notranslate"><span class="pre">nccl</span></code>后端，设置GPU之间通信使用的后端和端口：</p>
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="c1"># ps 检查nccl是否可用</span>
<span class="c1"># torch.distributed.is_nccl_available ()</span>
<span class="n">torch</span><span class="o">.</span><span class="n">distributed</span><span class="o">.</span><span class="n">init_process_group</span><span class="p">(</span><span class="n">backend</span><span class="o">=</span><span class="s1">&#39;nccl&#39;</span><span class="p">)</span> <span class="c1"># 选择nccl后端，初始化进程组</span>
</pre></div>
</div>
<p>之后，使用 <code class="docutils literal notranslate"><span class="pre">DistributedSampler</span></code> 对数据集进行划分。它能帮助我们将每个 batch 划分成几个 partition，在当前进程中只需要获取和 rank 对应的那个 partition 进行训练：</p>
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="c1"># 创建Dataloader</span>
<span class="n">train_sampler</span> <span class="o">=</span> <span class="n">torch</span><span class="o">.</span><span class="n">utils</span><span class="o">.</span><span class="n">data</span><span class="o">.</span><span class="n">distributed</span><span class="o">.</span><span class="n">DistributedSampler</span><span class="p">(</span><span class="n">train_dataset</span><span class="p">)</span>
<span class="n">train_loader</span> <span class="o">=</span> <span class="n">torch</span><span class="o">.</span><span class="n">utils</span><span class="o">.</span><span class="n">data</span><span class="o">.</span><span class="n">DataLoader</span><span class="p">(</span><span class="n">train_dataset</span><span class="p">,</span> <span class="n">batch_size</span><span class="o">=</span><span class="mi">16</span><span class="p">,</span> <span class="n">sampler</span><span class="o">=</span><span class="n">train_sampler</span><span class="p">)</span>
</pre></div>
</div>
<p>注意： testset不用sampler</p>
<p>然后使用<code class="docutils literal notranslate"><span class="pre">torch.nn.parallel.DistributedDataParallel</span></code>包装模型：</p>
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="c1"># DDP进行训练</span>
<span class="n">model</span> <span class="o">=</span> <span class="n">torch</span><span class="o">.</span><span class="n">nn</span><span class="o">.</span><span class="n">parallel</span><span class="o">.</span><span class="n">DistributedDataParallel</span><span class="p">(</span><span class="n">model</span><span class="p">,</span> <span class="n">device_ids</span><span class="o">=</span><span class="p">[</span><span class="n">args</span><span class="o">.</span><span class="n">local_rank</span><span class="p">])</span>
</pre></div>
</div>
<p><strong>如何启动DDP</strong></p>
<p>那么如何启动DDP，这不同于DP的方式，需要使用torch.distributed.launch启动器，对于单机多卡的情况：</p>
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="n">CUDA_VISIBLE_DEVICES</span><span class="o">=</span><span class="mi">0</span><span class="p">,</span><span class="mi">1</span><span class="p">,</span><span class="mi">2</span><span class="p">,</span><span class="mi">3</span> <span class="n">python</span> <span class="o">-</span><span class="n">m</span> <span class="n">torch</span><span class="o">.</span><span class="n">distributed</span><span class="o">.</span><span class="n">launch</span> <span class="o">--</span><span class="n">nproc_per_node</span><span class="o">=</span><span class="mi">4</span> <span class="n">main</span><span class="o">.</span><span class="n">py</span>
<span class="c1"># nproc_per_node: 这个参数是指你使用这台服务器上面的几张显卡</span>
</pre></div>
</div>
<blockquote>
<div><p>有时候虽然说，可以简单使用DP，但是DDP的效率是比DP高的，所以很多时候单机多卡的情况，我们还是会去使用DDP</p>
</div></blockquote>
</section>
</section>
<section id="dp-ddp">
<h3>DP 与 DDP 的优缺点<a class="headerlink" href="#dp-ddp" title="永久链接至标题">#</a></h3>
<section id="id7">
<h4>DP 的优势<a class="headerlink" href="#id7" title="永久链接至标题">#</a></h4>
<p><code class="docutils literal notranslate"><span class="pre">nn.DataParallel</span></code>没有改变模型的输入输出，因此其他部分的代码不需要做任何更改，非常方便，一行代码即可搞定。</p>
</section>
<section id="id8">
<h4>DP 的缺点<a class="headerlink" href="#id8" title="永久链接至标题">#</a></h4>
<p><code class="docutils literal notranslate"><span class="pre">DP</span></code>进行分布式多卡训练的方式容易造成负载不均衡，第一块GPU显存占用更多，因为输出默认都会被gather到第一块GPU上，也就是后续的loss计算只会在<code class="docutils literal notranslate"><span class="pre">cuda:0</span></code>上进行，没法并行。</p>
<p>除此之外<code class="docutils literal notranslate"><span class="pre">DP</span></code>只能在单机上使用，且<code class="docutils literal notranslate"><span class="pre">DP</span></code>是单进程多线程的实现方式，比<code class="docutils literal notranslate"><span class="pre">DDP</span></code>多进程多线程的方式会效率低一些。</p>
</section>
<section id="id9">
<h4>DDP的优势<a class="headerlink" href="#id9" title="永久链接至标题">#</a></h4>
<p><strong>1. 每个进程对应一个独立的训练过程，且只对梯度等少量数据进行信息交换。</strong></p>
<p><strong><code class="docutils literal notranslate"><span class="pre">DDP</span></code></strong> 在每次迭代中，每个进程具有自己的 <code class="docutils literal notranslate"><span class="pre">optimizer</span></code> ，并独立完成所有的优化步骤，进程内与一般的训练无异。</p>
<p>在各进程梯度计算完成之后，各进程需要将<strong>梯度</strong>进行汇总平均，然后再由 <code class="docutils literal notranslate"><span class="pre">rank=0</span></code> 的进程，将其 <code class="docutils literal notranslate"><span class="pre">broadcast</span></code> 到所有进程。之后，各进程用该梯度来独立的更新参数。而 <code class="docutils literal notranslate"><span class="pre">DP</span></code>是<strong>梯度汇总到主</strong> <code class="docutils literal notranslate"><span class="pre">GPU</span></code>，<strong>反向传播更新参数</strong>，再广播参数给其他的 GPU。</p>
<p><strong><code class="docutils literal notranslate"><span class="pre">DDP</span></code></strong> 中由于各进程中的模型，初始参数一致 (初始时刻进行一次 <code class="docutils literal notranslate"><span class="pre">broadcast</span></code>)，而每次用于更新参数的梯度也一致，因此，各进程的模型参数始终保持一致。</p>
<p>而在<code class="docutils literal notranslate"><span class="pre">DP</span></code> 中，全程维护一个 <code class="docutils literal notranslate"><span class="pre">optimizer</span></code>，对各 <code class="docutils literal notranslate"><span class="pre">GPU</span></code> 上梯度进行求和，而在主 <code class="docutils literal notranslate"><span class="pre">GPU</span></code> 进行参数更新，之后再将模型参数 <code class="docutils literal notranslate"><span class="pre">broadcast</span></code> 到其他 <code class="docutils literal notranslate"><span class="pre">GPU</span></code>。</p>
<p>相较于**<code class="docutils literal notranslate"><span class="pre">DP</span></code><strong>，</strong><code class="docutils literal notranslate"><span class="pre">DDP</span></code>**传输的数据量更少，因此速度更快，效率更高。</p>
<p><strong>2. 每个进程包含独立的解释器和 GIL。</strong></p>
<p>一般使用的 <code class="docutils literal notranslate"><span class="pre">Python</span></code> 解释器 <code class="docutils literal notranslate"><span class="pre">CPython</span></code>：是用 <code class="docutils literal notranslate"><span class="pre">C</span></code> 语言实现 <code class="docutils literal notranslate"><span class="pre">Pyhon</span></code>，是目前应用最广泛的解释器。全局锁使 <code class="docutils literal notranslate"><span class="pre">Python</span></code> 在多线程效能上表现不佳，全局解释器锁（<code class="docutils literal notranslate"><span class="pre">Global</span> <span class="pre">Interpreter</span> <span class="pre">Lock</span></code>）是 <code class="docutils literal notranslate"><span class="pre">Python</span></code> 用于同步线程的工具，使得任何时刻仅有一个线程在执行。</p>
<p>由于每个进程拥有独立的解释器和 <code class="docutils literal notranslate"><span class="pre">GIL</span></code>，消除了来自单个 <code class="docutils literal notranslate"><span class="pre">Python</span></code> 进程中的多个执行线程，模型副本或 <code class="docutils literal notranslate"><span class="pre">GPU</span></code> 的额外解释器开销和 <code class="docutils literal notranslate"><span class="pre">GIL-thrashing</span></code> ，因此可以减少解释器和 <code class="docutils literal notranslate"><span class="pre">GIL</span></code> 使用冲突。这对于严重依赖 <code class="docutils literal notranslate"><span class="pre">Python</span> <span class="pre">runtime</span></code> 的 <code class="docutils literal notranslate"><span class="pre">models</span></code> 而言，比如说包含 <code class="docutils literal notranslate"><span class="pre">RNN</span></code> 层或大量小组件的 <code class="docutils literal notranslate"><span class="pre">models</span></code> 而言，这尤为重要。</p>
</section>
<section id="id10">
<h4>DDP 的缺点<a class="headerlink" href="#id10" title="永久链接至标题">#</a></h4>
<p>暂时来说，<code class="docutils literal notranslate"><span class="pre">DDP</span></code>是采用多进程多线程的方式，并且训练速度较高，他的缺点主要就是，需要修改比较多的代码，比<code class="docutils literal notranslate"><span class="pre">DP</span></code>的一行代码较为繁琐许多。</p>
</section>
</section>
</section>
<section id="id11">
<h2>参考资料：<a class="headerlink" href="#id11" title="永久链接至标题">#</a></h2>
<ol class="simple">
<li><p><a class="reference external" href="https://blog.csdn.net/kuweicai/article/details/120516410">Pytorch 并行训练（DP， DDP）的原理和应用</a></p></li>
<li><p><a class="reference external" href="https://zhuanlan.zhihu.com/p/447563272">Pytorch中单机多卡分布式训练</a></p></li>
<li><p><a class="reference external" href="https://medium.com/huggingface/training-larger-batches-practical-tips-on-1-gpu-multi-gpu-distributed-setups-ec88c3e51255">Training Neural Nets on Larger Batches: Practical Tips for 1-GPU, Multi-GPU &amp; Distributed setups</a></p></li>
<li><p><a class="reference external" href="https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html?highlight=distributeddataparallel#torch.nn.parallel.DistributedDataParallel">DISTRIBUTEDDATAPARALLEL</a></p></li>
<li><p><a class="reference external" href="https://blog.csdn.net/ytusdc/article/details/122091284">Pytorch 分布式训练（DP/DDP）</a></p></li>
<li><p><a class="reference external" href="https://pytorch.org/docs/stable/distributed.html">DISTRIBUTED COMMUNICATION PACKAGE - TORCH.DISTRIBUTED</a></p></li>
<li><p><a class="reference external" href="https://zhuanlan.zhihu.com/p/86441879">pytorch多gpu并行训练</a></p></li>
</ol>
</section>
</section>


              </div>
              
            </main>
            <footer class="footer-article noprint">
                
    <!-- Previous / next buttons -->
<div class='prev-next-area'>
    <a class='left-prev' id="prev-link" href="2.2%20%E8%87%AA%E5%8A%A8%E6%B1%82%E5%AF%BC.html" title="上一页 页">
        <i class="fas fa-angle-left"></i>
        <div class="prev-next-info">
            <p class="prev-next-subtitle">上一页</p>
            <p class="prev-next-title">2.2 自动求导</p>
        </div>
    </a>
    <a class='right-next' id="next-link" href="2.4%20AI%E7%A1%AC%E4%BB%B6%E5%8A%A0%E9%80%9F%E8%AE%BE%E5%A4%87.html" title="下一页 页">
    <div class="prev-next-info">
        <p class="prev-next-subtitle">下一页</p>
        <p class="prev-next-title">AI硬件加速设备</p>
    </div>
    <i class="fas fa-angle-right"></i>
    </a>
</div>
            </footer>
        </div>
    </div>
    <div class="footer-content row">
        <footer class="col footer"><p>
  
    By ZhikangNiu<br/>
  
      &copy; Copyright 2022, ZhikangNiu.<br/>
</p>
        </footer>
    </div>
    
</div>


      </div>
    </div>
  
  <!-- Scripts loaded after <body> so the DOM is not blocked -->
  <script src="../_static/scripts/pydata-sphinx-theme.js?digest=1999514e3f237ded88cf"></script>


  </body>
</html>